Back

Protein Engineering, Design and Selection

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match Protein Engineering, Design and Selection's content profile, based on 14 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.

1
CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

Chen, Y.; Fu, L.; Lu, X.; Li, W.; Gao, Y.; Wang, Y.; Ruan, Z.; Si, T.

2026-03-25 synthetic biology 10.64898/2026.03.24.714074 medRxiv
Top 0.1%
3.8%
Show abstract

Combinatorial mutagenesis is essential for exploring protein sequence-function landscapes in engineering applications. However, while large-scale machine learning benchmarks exist for protein function prediction, they are primarily limited to single-mutant libraries, leaving a critical gap for combinatorial mutagenesis. Here we introduce CombinGym, a benchmarking platform featuring 14 curated combinatorial mutagenesis datasets spanning 9 proteins with diverse functional properties including binding affinity, fluorescence, and enzymatic activities. We evaluated nine machine learning algorithms from five methodological categories (alignment-based, protein language, structure-based, sequence-label, and substitution-based) across multiple prediction tasks, assessing both zero-shot and supervised learning performance using Spearmans {rho} and Normalized Discounted Cumulative Gain metrics. Our analysis reveals the substantial impact of measurement noise and data processing strategies on model performance. By implementing hierarchical dataset splits (0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest scenarios), we demonstrate the value of lower-order mutation data for empowering machine learning models to predict higher-order mutant properties. We validated this capacity through both in silico simulation (improving fluorescence brightness of an oxygen-independent fluorescent protein) and experimental validation (engineering enzyme substrate specificity), achieving a substantial increase in specific activity. All datasets, benchmarks, and metrics are available through an interactive website (https://www.combingym.org), facilitating collaborative dataset expansion and model development through integration with automated biofoundry platforms.

2
Benchmarking and Experimental Validation of Machine Learning Strategies for Enzyme Engineering

Zeng, Z.; Jin, J.; Xu, R.; Luo, X.

2026-03-30 bioengineering 10.64898/2026.03.29.715152 medRxiv
Top 0.1%
2.6%
Show abstract

Enzyme-directed evolution increasingly relies on computational tools to prioritize mutations, yet their practical value is difficult to assess because kinetic data are often aggregated across heterogeneous assay conditions, inflating apparent generalization. Here we introduce EnzyArena, a curated benchmark that groups kinetic parameters (kcat, Km, kcat/Km) into condition-matched experimental subsets to enable realistic evaluation. Using this resource, we benchmark 10 representative models from two arising strategy families--zero-shot fitness prediction and supervised kinetic-parameter prediction--across BRENDA- and SABIO-RK-derived subsets and 25 independent mutagenesis datasets. Kinetic-parameter predictors perform strongly on database-derived subsets but lose their advantage on independent datasets, whereas zero-shot predictors show more consistent generalization. A simple consensus of multiple zero-shot models further improves the precision of identifying beneficial mutants. We prospectively validated these findings in a wet-lab campaign (150 mutants) comparing random mutants, UniKP-prioritized mutants and ESM-1v-prioritized mutants (representing supervised kinetic-parameter prediction and zero-shot fitness prediction, respectively), where ESM-1v achieved the highest utility and UniKP underperformed the random baseline. Together, this study establishes realistic baselines for computational mutant prioritization and highlights consensus zero-shot strategies as a practical starting point for enzyme engineering.

3
Library docking for Cannabinoid-2 Receptor ligands

Rachman, M. M.; Iliopoulos-Tsoutsouvas, C.; Dominic Sacco, M.; Xu, X.; Wu, C.-G.; Santos, E.; Glenn, I. S.; Paris, L.; Cahill, M. K.; Ganapathy, S.; Tummino, T. A.; Moroz, Y. S.; Radchenko, D. S.; Okorie, M.; Tawfik, V. L.; Irwin, J. J.; Makriyannis, A.; Skiniotis, G.; Shoichet, B. K.

2026-03-21 biochemistry 10.64898/2026.03.19.713017 medRxiv
Top 0.1%
1.5%
Show abstract

Cannabinoid receptors are therapeutically promising GPCRs that are also interesting test systems for structure-based methods, which have targeted them previously. Here we used the CB2 receptor as a template to explore several topical questions in library docking. Whereas an earlier campaign against the CB1 receptor led to potent but relatively non-selective ligands, here we found that targeting interactions with polar, orthosteric site residues led to subtype-selective ligands. Docking hit rate and especially hit affinity improved in moving from a 7 million to a 2.6 billion molecule library. Similar to earlier studies, docking against active and inactive states of the receptor did not reliably bias toward the discovery of agonists or inverse agonists. Cryo-EM structures of two of the new agonists, each in a different chemotype, superposed well on the docking predictions. Correspondingly, structure-based optimization led to 10- to 140-fold improvements within three different series, also consistent with well-behaved ligand families. Hit rates with a fully enumerated 2.6 billion molecule library resembled those of an implied 11 billion molecule library from a building-block method, consistent with the latters ability to explore this space, though higher affinities were discovered from the fully enumerated set. Overall, eight diverse families of ligands, with potencies <100 nM and mostly unrelated to previously known ligands were found. Implications for future studies are considered.

4
Surface Display For Phage Assisted Continuous Evolution: A Platform For Evolving / Screening Nanobodies In Prokaryote Systems

Flores-Mora, F. E.; Brodsky, J.; Cerna, G. M.; Tse, A.; Hoover, R. L.; Bartelle, B. B.

2026-04-04 synthetic biology 10.64898/2026.04.03.716437 medRxiv
Top 0.1%
1.5%
Show abstract

Despite >50 years of methods development, specific antibodies are still generated at low throughput and remain in high demand across biotechnology. Most biologics and immunoprobes are monoclonal antibodies, developed using a combination of inoculating animals with a target antigen, engineered candidate libraries, and multiple rounds of selection using phage or yeast display. Here we introduce a synthetic biology scheme to eliminate the need for nearly all of these steps, by combining Surface display on E. coli and Phage display with the microvirus {Phi}X174, Assisting Continuous Evolution (SurPhACE). Instead of building libraries for screening, SurPhACE runs a closed evolutionary program. A typical experiment can have 1011 mutant candidates under active selection, with complete turnover of the mutant population every 30min, or >5x1012 unique mutants per day, using less than 100mL of bacterial culture media. We demonstrate SurPhACE for optimizing a nanobody to a related epitope, and develop novel nanobodies for an arbitrary target using a minimal starting library to establish a proof of concept and identify best practices for this scalable method for generating protein binders.

5
GROQ-seq Enables Cross-site Reproducibility for High-Throughput Measurement of Protein Function

Spinner, A.; Ross, D.; Cortade, D.; Ikonomova, S.; Baranowski, C.; Dhroso, A.; Reider Apel, A.; Sheldon, K.; Duquette, C.; Kelly, P. J.; DeBenedictis, E.; Hudson, C.

2026-04-09 bioengineering 10.64898/2026.04.07.716961 medRxiv
Top 0.1%
1.0%
Show abstract

High-throughput functional assays are increasingly used to generate large-scale protein function datasets for protein engineering and machine learning applications. However, the utility of such datasets depends on the reproducibility of the underlying measurements. Here we report reproducible, quantitative measurements of protein sequence-to-function data at scale across two facilities. We analyze GROQ-seq (Growth-based Quantitative Sequencing) measurements of three bacterial transcription factors. Independent barcode measurements of the same sequence produce highly consistent functional estimates, demonstrating strong biological reproducibility (across all transcription factors the mean Root Mean Square Deviation [RMSD] {approx} 0.53 and mean Spearman {approx} 0.63). We also compared experiments performed at two facilities using a shared protocol, but with differing levels of automation and system integration. We observe strong agreement between measurements taken at the two sites (mean RMSD {approx} 0.41 and mean Spearman {approx} 0.730). Orthogonal tests further support this agreement: a classifier trained to distinguish data by site performs near random (AUC = 0.559), and top-ranking variants show strong statistical overlap between experiments. Together, these results demonstrate that GROQ-seq enables reproducible, scalable measurement of protein function suitable for large aggregated datasets.

6
Teaching Diffusion Models Physics: Reinforcement Learning for Physically Valid Diffusion-Based Docking

Broster, J. H.; Popovic, B.; Kondinskaia, D.; Deane, C. M.; Imrie, F.

2026-03-27 bioinformatics 10.64898/2026.03.25.714128 medRxiv
Top 0.1%
0.8%
Show abstract

Molecular docking aims to predict the binding conformation of a small molecule to its protein target. Recent work has proposed diffusion models for this task, from rigid-body docking that diffuses over ligand degrees of freedom to co-folding approaches that jointly generate protein structure and ligand pose. However, diffusion-based docking models have been shown to frequently produce physically implausible poses and fail to consistently recover key protein-ligand interactions. To address this, we introduce a reinforcement learning framework for training diffusion-based docking models directly on non-differentiable objectives. Fine-tuning DiffDock-Pocket for physical validity with our approach substantially increases the number of generated poses that are physically valid and interaction-preserving, with no increase in inference-time compute. Importantly, this comes without sacrificing structural accuracy; in fact, our approach increases the proportion of structures with near-native poses. These effects are most pronounced for protein targets that are dissimilar to the training data. Our fine-tuned DiffDock-Pocket model outperforms both classical docking algorithms and machine learning-based approaches on the PoseBusters set. Our results demonstrate that reinforcement learning can teach diffusion-based docking models to better respect physical constraints and recover key interactions, without the requirement to rely on inference-time corrections.

7
When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR--Peptide Binding Prediction

Qi, C.; Wang, W.; Fang, H.; Wei, Z.

2026-04-02 bioinformatics 10.64898/2026.03.31.715453 medRxiv
Top 0.1%
0.6%
Show abstract

Multimodal learning is commonly assumed to improve predictive performance, yet in biological applications auxiliary modalities are often imperfect and can degrade learning if fused naively. We investigate this problem in TCR-peptide binding prediction, where sequence embeddings from pretrained protein language models are strong and transferable, but structure-derived residue graphs are built from predicted folds and heuristic discretization. In this setting, structural views can be noisy, inconsistent, and difficult to optimize jointly with sequence features. We introduce TRACE, a lightweight multimodal framework that encodes each entity (TCR and peptide) with parallel sequence and graph towers, then applies CLIP-style intra-entity contrastive alignment before interaction modeling. The alignment objective regularizes representation geometry by encouraging modality consistency for the same biological entity, thereby preventing unstable graph signals from dominating fusion. Across protocol-aware TCHard RN evaluations, naive sequence+graph fusion frequently underperforms a sequence-only baseline and can collapse toward near-random behavior. In contrast, TRACE consistently restores and improves performance. Controlled noise and supervision sweeps show that these gains persist under increasing graph corruption and positive-label scarcity, indicating that alignment is especially important when training conditions are hard. Our results challenge the assumption that adding modalities is inherently beneficial. Instead, they highlight a central principle for robust multimodal bioinformatics: performance depends not only on what modalities are used, but on how their interaction is constrained during optimization. TRACE provides a simple and general recipe for leveraging imperfect structural information without sacrificing stability.

8
Agentic systems are adept at solving well-scoped, verifiable problems in computational biology

Nair, S.; Gunsalus, L.; Orcutt-Jahns, B.; Rossen, J.; Lal, A.; Donno, C. D.; Celik, M. H.; Fletez-Brant, K.; Xie, X.; Bravo, H. C.; Eraslan, G.

2026-04-09 bioinformatics 10.64898/2026.04.06.716850 medRxiv
Top 0.1%
0.6%
Show abstract

We introduce CompBioBench, a benchmark of 100 diverse tasks for evaluating agentic systems in computational biology. Unlike mathematics and programming, which more readily admit systematic verification, biological data are inherently noisy and open to interpretation. To enable objective evaluation without reducing tasks to prescriptive checklists, we propose a new benchmark construction strategy based on synthetic/augmented data and metadata scrambling/scrubbing of real datasets to create challenging problems with a single ground-truth answer that require multi-step reasoning, tool use, bespoke code, and interaction with real-world external resources. The benchmark spans genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine learning workflows. Questions are curated by domain experts to cover a broad range of skills with varying difficulty. We evaluate leading general-purpose agentic systems starting from a bare-minimum environment, requiring them to fetch data and tools as needed to solve each problem. We find strong end-to-end performance, with Codex CLI (GPT 5.4) reaching 83% accuracy and Claude Code (Opus 4.6) reaching 81%. On the hardest questions, Codex CLI (GPT 5.4) reaches 59%, while Claude Code (Opus 4.6) reaches 69%. CompBioBench provides a practical testbed for measuring the progress of agentic systems in computational biology and for guiding future benchmark design.

9
ABAG-Rank: Improving Model Selection of AlphaFold Antibody-Antigen Complexes by Learning to Rank

Tadiello, M.; Ludaic, M.; Viliuga, V.; Elofsson, A.

2026-03-19 bioinformatics 10.64898/2026.03.17.712376 medRxiv
Top 0.1%
0.5%
Show abstract

MotivationAlphaFold has transformed structural biology with an unprecedented accuracy in modeling protein structures and their interactions with biomolecules, with AlphaFold3 (AF3) achieving state-of-the-art performance. However, AF3 and other methods often struggle to accurately predict the structure of protein complexes that lack strong co-evolutionary information, such as antibody-antigen (Ab-Ag) complexes. One of the fundamental issues is that AF3 often generates accurate predictions, but fails to reliably distinguish them from the much larger set of incorrect ones. ResultsTo address this, we propose ABAG-Rank, a deep neural network that provides an efficient and robust solution for model selection of Ab-Ag interactions from a pool of structural ensembles predicted with AlphaFold. Built on the permutation-invariant DeepSets architecture, ABAG-Rank can process variable-sized ensembles of structural decoys and is directly applicable to prediction settings in which the number of candidates may vary. We train a model on a redundancy-reduced set of all known antibody-antigen complexes and find that simple geometric descriptors, along with confidence scores from AlphaFold, provide rich information about interface quality without requiring intensive physics-based calculations. Our experiments demonstrate that ABAG-Rank significantly outperforms AF3 internal scoring and the ranking performance of existing deep learning baselines. ImplementationSource code can be found at: https://github.com/tadteo/ABAG-Rank

10
CROWN: Curated Repository Of Well-resolved Noncovalent interactions

Poelmans, R.; Van Eynde, W.; Bruncsics, B.; Bruncsics, B.; Arany, A.; Moreau, Y.; Voet, A. R.

2026-04-01 bioinformatics 10.64898/2026.03.30.714168 medRxiv
Top 0.2%
0.5%
Show abstract

AbstractThe development of machine learning models for protein-ligand interactions is fundamentally constrained by the quality and diversity of available structural data. Existing databases of protein-ligand complexes present researchers with an unsatisfying trade-off: carefully curated collections such as PDBBind and HiQBind offer high structural reliability but cover only a narrow slice of the Protein Data Bank (PDB), while large-scale resources like PLInder provide broad coverage at the expense of rigorous quality control. Here, we introduce CROWN (Curated Repository Of Well-resolved Non-covalent interactions), a machine learning-ready dataset that reconciles this tension by applying a comprehensive, fully automated preprocessing pipeline to the PLInder database. Starting from 649,915 protein-ligand interaction systems, CROWN applies a series of interleaved quality filters and processing stages addressing crystallographic resolution, ligand identity, pocket completeness, structural repair, interaction quality, and protonation at physiological pH. A distinguishing feature of the pipeline is a final constrained energy minimisation step using custom flat-bottomed restraints, which balances crystallographic evidence with relaxation of intramolecular strain. This step -- absent from existing protein-ligand datasets -- produces structurally uniform complexes by reconciling the heterogeneous refinement practices of different crystallographers and structure determination protocols, without distorting the experimentally observed binding geometry. The resulting dataset of 153,005 complexes represents a roughly four-fold increase in protein and species diversity over PDBBind and HiQBind, while maintaining rigorous structural standards. Importantly, CROWN adopts a geometry-centric design philosophy that treats the 3D arrangement of atoms at the binding interface as a self-consistent source of information, rather than relying on externally measured binding affinities that cover only a fraction of known structures and introduce well-documented biases. We anticipate that CROWN will serve as a broadly useful resource for training generative models of protein-ligand binding poses, developing scoring functions, and benchmarking interaction prediction methods.

11
eSIG-Net: Accurate prediction of single-mutation induced perturbations on protein interactions using a language model

Pan, X.; Shrawat, A.; Raghavan, S.; Dong, C.; Yang, Y.; Li, Z.; Zheng, W. J.; Eckhardt, S. G.; Wu, E.; Fuxman Bass, J. I.; Jarosz, D. F.; Chen, S.; McGrail, D. J.; Sheynkman, G. M.; Huang, J. H.; Sahni, N.; Yi, S. S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714913 medRxiv
Top 0.2%
0.5%
Show abstract

Most proteins exert their functions in complex with other interactors. Single mutations can exhibit a profound impact on perturbing protein interactions, leading to human disease. However, predicting the effect of single mutations on protein interactions remains a major computational challenge. Deep learning, particularly protein language models or transformers, has become an effective tool in bioinformatics for protein structure prediction. However, the functional divergence of mutations makes it difficult to predict their interaction perturbation profiles. To address this fundamental challenge, we present eSIG-Net (edgetic mutation Sequence-based Interaction Grammar Network), a novel sequence-based "Interaction Language Model" for predicting protein interaction alterations caused by single mutations. eSIG-Net combines various protein sequence embeddings, introduces a mutation-encoding module with syntax and evolutionary insights, and employs contrastive learning to evaluate mutation-induced interaction changes. eSIG-Net significantly outperforms current state-of-the-art sequence-based and structure-based prediction methods at predicting mutational impact on protein interactions. We highlight examples where eSIG-Net nominates causal variants with high confidence and elucidates their functional role under relevant biological contexts. Together, eSIG-Net is a first-in-kind "interaction language model" that can accurately predict interaction-specific rewiring by single mutations with only sequence information, and exhibits generalizability across biological contexts.

12
Residue burial encodes a protein's fold

Grigas, A. T.; Sumner, J.; O'Hern, C. S.

2026-03-31 biophysics 10.64898/2026.03.28.714986 medRxiv
Top 0.2%
0.5%
Show abstract

Protein structure is controlled by a high-dimensional energy landscape, which is a function of all of the atomic coordinates of the protein. Can this landscape be accurately described by a low-dimensional representation? We find that residue core identity, a binary N-dimensional encoding indicating whether each of the N amino acids in a protein is buried in the core or not, can predict the proteins backbone conformation more efficiently than all other representations that we tested. Core identity is 4 times more efficient than previous estimates of the bits per residue needed to encode a proteins native fold, 2 times more efficient than the C contact map, and 1.5 times more efficient than the machine-learned embeddings from FoldSeeks 3Di. Even when the folded structure is unavailable, predicting each residues burial from sequence yields a more accurate estimate of fold quality than predicting pairwise contacts from the same sequence information. Thus, this work emphasizes that the problem of determining a proteins native fold can be re-framed as predicting each residues core identity.

13
AI-guided design of candidate BMPR1A-binding peptides for cartilage regeneration: a multi-tool computational benchmarking study

Ahmadov, A.; Ahmadov, O.

2026-03-25 bioinformatics 10.64898/2026.03.22.713519 medRxiv
Top 0.2%
0.5%
Show abstract

Bone morphogenetic protein receptor type IA (BMPR1A) is a key mediator of chondrogenesis and a validated therapeutic target for cartilage repair, yet existing BMP mimetic peptides suffer from low potency and the full-length protein (rhBMP-2) carries significant safety risks. Generative AI tools for protein design can now produce de novo peptide binders, but none have been applied to cartilage regeneration targets. Here, we benchmarked four architecturally distinct AI tools--RFdiffusion, BindCraft, PepMLM, and RFpeptides--to design candidate BMPR1A-binding peptides. We generated 192 candidates alongside 98 negative controls (290 total) and evaluated all complexes using AlphaFold 3 structure prediction, dual physics-based energy scoring (PyRosetta and FoldX), and contact recapitulation against the crystallographic BMP-2:BMPR1A interface (PDB: 1REW). A four-metric composite ranking identified a 15-residue PepMLM design (pepmlm_L15_0026) as the top candidate, combining favorable binding energy (PyRosetta dGseparated = -45.9 REU; FoldX {Delta}G = -19.4 kcal/mol) with the highest contact recapitulation among top-ranked peptides (11/30 gold-standard interface residues). Designed candidates significantly outperformed controls on ipTM (p = 0.002) and FoldX {Delta}G (p < 0.001). BindCraft candidates achieved the highest structural confidence (ipTM up to 0.81) but exhibited moderate contact recapitulation (mean 0.224), consistent with the computational hypothesis that they may engage alternative BMPR1A binding surfaces rather than the native BMP-2 interface. Physicochemical filtering yielded a shortlist of 54 candidates across all four tools. These results establish a reproducible computational framework for AI-guided peptide design targeting cartilage regeneration and identify specific candidates for future experimental validation via binding assays and chondrocyte differentiation studies. Author summaryDamaged cartilage has limited capacity to heal, and current biological therapies based on bone morphogenetic protein 2 (BMP-2) carry serious safety concerns including ectopic bone formation and inflammation. Short peptides that mimic BMP-2s interaction with its receptor BMPR1A could offer a safer, more targeted alternative, but designing such peptides from scratch is challenging. We used four different artificial intelligence tools--each employing a distinct computational strategy--to generate 192 candidate peptides designed to bind BMPR1A. We then evaluated all candidates using multiple independent computational methods to assess binding quality, energy favorability, and whether each peptide targets the correct site on the receptor. Our analysis identified a shortlist of 54 promising candidates, with a 15-residue peptide from the language model-based tool PepMLM emerging as the top-ranked design. We also found evidence that one tool (BindCraft) may produce peptides that bind BMPR1A at sites different from the natural BMP-2 interface, highlighting the importance of validating not just whether a peptide binds, but where it binds. Our computational framework and candidate peptides provide a foundation for future laboratory testing toward cartilage repair therapies.

14
Structural basis for saccharide binding by human RNase 2/EDN, a protein combining enzymatic and lectin properties

Kang, X.; Prats-Ejarque, G.; Boix, E.; Li, J.

2026-03-23 biochemistry 10.64898/2026.03.20.713198 medRxiv
Top 0.2%
0.4%
Show abstract

Human RNase 2 (eosinophil-derived neurotoxin, EDN) is a major eosinophil granule protein of the vertebrate-specific RNase A superfamily and is involved in antiviral response and inflammation. Identifying ligand-binding pockets in EDN is thus relevant to structure-based drug design. In our laboratory we identified by protein crystallography a conserved site at the protein surface binding to carboxylic anion molecules (malonate, tartrate and citrate). Searching for potential biomolecules rich in anion groups and considering previous report of EDN binding to glycosaminoglycans, we explored the protein binding to saccharides. Next, EDN crystals were soaked with mono- and disaccharides, and the 3D structures of ten complexes were solved by X-ray crystallography at atomic resolution. We identified protein binding pockets to glucose, fucose, mannose, sucrose, galactose, trehalose, N-acetyl-D-glucosamine, N-acetylmuramic acid, and the sialic acid N-acetylneuraminic acid. A main site for glucose, fucose, and galactose was located adjacent to the spotted carboxylic anion site. Secondarily, N-acetylneuraminic acid, N-acetylmuramic acid, sucrose, galactose, and mannose shared another protein surface region. Overall, the saccharides clustered into seven defined sites, outlining a conserved recognition pattern, which was further analysed by molecular modelling. Interestingly, within the RNase A family, we find amphibian RNases that were initially isolated as carbohydrate binding proteins and named as leczymes, combining enzymatic and lectin properties. The present data is the first systematic structural characterization of a mammalian sugar-binding RNase within the family. The results highlight unique EDN residues that mediate its sugar specific interactions, of particular interest for a better understanding of the protein physiological role. HighlightsO_LIstructure of RNase 2 in complex with mono and disaccharides at atomic resolution C_LIO_LIidentification of RNase 2 unique sugar binding sites C_LIO_LIcharacterization of a mammalian RNase A family enzyme with lectin properties C_LI Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/713198v1_ufig1.gif" ALT="Figure 1"> View larger version (46K): org.highwire.dtl.DTLVardef@1d805f7org.highwire.dtl.DTLVardef@16fcc49org.highwire.dtl.DTLVardef@ccfd92org.highwire.dtl.DTLVardef@1b8f1e_HPS_FORMAT_FIGEXP M_FIG C_FIG

15
GEF me a break: the consequences of freezing Rho guanine-nucleotide exchange factor catalytic domains

Anderson, L. K.; Barpal, E.; Mendoza, H.; Cash, J. N.

2026-04-09 biochemistry 10.64898/2026.04.08.717323 medRxiv
Top 0.2%
0.4%
Show abstract

Purified proteins are routinely flash frozen for use in functional and structural studies, providing a convenient way to reproduce results across complex experiments. Rho guanine-nucleotide exchange factors (RhoGEFs) are no exception to this practice, yet the effects of freezing on their activity and stability remain largely uncharacterized. This gap potentially affects the characterization of these important enzymes and how results are interpreted with respect to their prospective use as therapeutic targets. Here, we tested the isolated DH/PH tandems of P-Rex1, P-Rex2, and PRG under different cryoprotectant conditions and monitored activity and thermostability over time after flash freezing. Our results show a clear divergence between the activity of fresh and frozen purified RhoGEF protein samples in as little as one week for some conditions. Specifically, the variability in data collected on frozen samples was greatly increased. Despite these differences, thermostability seems to be preserved for much longer timepoints across RhoGEFs. Moreover, despite eventual changes in both activity and thermostability with respect to freezing, there are no obvious changes in global conformation between fresh and frozen samples of the isolated P-Rex2 DH/PH tandem. From our data, there are few generalizable trends between the different RhoGEFs and no single cryoprotective agent tested was a silver bullet to preserve both activity and thermostability across RhoGEFs. Overall, our findings emphasize the unpredictable effects of freezing RhoGEFs. As such, RhoGEF freezing should be carefully characterized for each protein and critically viewed when comparing analyses between different studies.

16
A conserved isoleucine gates the diffusion of small ligands to the active site of NiFe CO-dehydrogenase

Opdam, L.; Meneghello, M.; Guendon, C.; Chargelegue, J.; Fasano, A.; Jacq-Bailly, A.; Leger, C.; Fourmond, V.

2026-03-21 biochemistry 10.64898/2026.03.19.713016 medRxiv
Top 0.3%
0.3%
Show abstract

CO dehydrogenases (CODH) are metalloenzymes that reversibly oxidize CO to CO2, at a buried NiFe4S4 active site. The substrates, CO and CO2, need therefore to be transported through the protein matrix to reach the active site. The most likely pathway for intra-protein diffusion is the hydrophobic channel identified in the crystal structures. Here, we use site-directed mutagenesis to study the highly conserved isoleucine 563 of Thermococcus sp. AM4 CODH2. Mutations at this position change the biochemical properties (KM for CO, product inhibition constant, catalytic bias...), and increase the resistance of the enzyme to the inhibitor O2, showing that isoleucine 563 indeed lines the gas channel. The I563F mutation decreases the bimolecular rate constant of inhibition by O2 15-fold, and increases the IC50 20-fold, which is the strongest improvement in O2 resistance reported so far. We show that the size of the introduced amino acids is less important than their flexibility - along with the size of the cavity formed near the active site in the channel. We also conclude that O2 access to the active site cannot be slowed down without also affecting CO diffusion. This tradeoff will have to be considered in further attempts to use site-directed mutagenesis to make CODHs more O2 tolerant.

17
Structure of human aldehyde oxidase under tris(2-carboxyethyl)phosphine-reducing conditions

Videira, C.; Esmaeeli, M.; Leimkuhler, S.; Romao, M. J.; Mota, C.

2026-03-25 biochemistry 10.64898/2026.03.25.713928 medRxiv
Top 0.3%
0.3%
Show abstract

The importance of human aldehyde oxidase (hAOX1) has increased over the last decades due to its involvement in drug metabolism. Inhibition studies concerning hAOX1 are extensive and a common reducing agent, dithiothreitol (DTT), was recently found to inactivate the enzyme. However, in previous crystallographic studies of hAOX1, DTT was found to be essential for crystallization. To surpass this concern another reducing agent used in crystallization trials. Using tris(2-carboxyethyl)phosphine (TCEP), a sulphur-free reducing agent, it was possible to obtain well-ordered crystals from hAOX1 wild type and variant, hAOX1_6A, which diffracted beyond 2.3 [A]. Instead of the typical star-shaped crystals of hAOX1, at pH 4.7, plates are obtained in the orthorhombic space group (P22121) with two molecules in the asymmetric unit. Activity assays with the enzyme incubated with both reducing agents show that contrary to DTT, TCEP does not lead to irreversible inactivation of the enzyme. The replacement of DTT with TCEP in crystallization of hAOX1 provides a strategy to circumvent enzyme inactivation during crystallographic studies, allowing future applications of new assays, such as time-resolved crystallography.

18
Automated Knowledge Graph Construction for CAR T Cell Receptor Design via Hybrid Text Mining

Luo, H.; Tang, D.; Zivanov, A.; Miskov-Zivanov, N.

2026-04-07 synthetic biology 10.64898/2026.04.06.716719 medRxiv
Top 0.3%
0.3%
Show abstract

Designing next-generation Chimeric Antigen Receptors (CARs) requires a systematic understanding of intracellular signaling domains and their downstream biological effects, yet no comprehensive knowledge resource currently exists for this purpose. Here, we present an automated workflow that integrates multiple natural language processing and large language model tools to extract biomolecular interactions from PubMed literature and assemble them into a CAR T cell signaling knowledge graph. Our pipeline combines REACH, INDRA, and Llama 3 across 15 targeted search queries, yielding a directed multi-relational graph of [~]7,500 unique interactions among [~]1,800 entities, including proteins, biological processes, and chemicals. We further demonstrate that queries incorporating biological process ontology terms retrieve more interaction-rich papers than protein-name-only searches, offering practical guidance for future literature mining efforts. The resulting knowledge base provides a structured foundation for predicting T cell phenotypes and prioritizing intracellular domain candidates for CAR design, with broader applicability to knowledge-driven inference in immunotherapy research.

19
Engineering a bifunctional alfa and beta hydrolase from a GH1 beta-glycosidase

Otsuka, F. A. M.

2026-03-20 bioengineering 10.64898/2026.03.19.712844 medRxiv
Top 0.3%
0.3%
Show abstract

Glycoside hydrolases (GHs) play central roles in carbohydrate metabolism and are widely exploited for industrial and biomedical applications. However, they are often not optimal for applications due to their constrained function and strict stereochemical specificity, necessitating the discovery and optimization of distinct enzymes for each glycosidic configuration. Members of glycoside hydrolase family 1 (GH1) are archetypal retaining {beta}-glycosidases, while -specific activity is rare within this family. Here, I demonstrate that a retaining GH1 enzyme can be engineered to hydrolyze both {beta}- and -configured substrates without altering its canonical catalytic residues. Using a well-characterized {beta}-glycosidase and computational protein design strategies targeting second-shell residues surrounding the active site, a bifunctional {beta}-/-glycosidase containing 45 mutations was generated. The engineered variant acquired the ability to hydrolyze the -configured substrate 4-nitrophenyl--D-glucopyranoside while retaining activity toward the originals {beta}-substrates, with reduced catalytic efficiency and thermostability. Structural modeling and docking analyses reveal that the engineered enzyme preserves the original fold and accommodates substrates within the catalytic pocket in a similar manner to the wild type. These findings provide direct evidence that stereochemical constraint in retaining GH is more flexible than previously appreciated and can be modulated through targeted engineering.

20
Evaluating FoldX5.1 for MAVISp Stability Data Collection

Vliora, A.; Tiberti, M.; Papaleo, E.

2026-04-02 bioinformatics 10.64898/2026.03.31.715598 medRxiv
Top 0.4%
0.2%
Show abstract

MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.